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Abstract. A great many tools have been developed for supervised clas- 
sification, ranging from early methods such as linear discriminant anal- 
ysis through to modern developments such as neural networks and sup- 
port vector machines. A large number of comparative studies have been 
conducted in attempts to establish the relative superiority of these 
methods. This paper argues that these comparisons often fail to take 
into account important aspects of real problems, so that the apparent 
superiority of more sophisticated methods may be something of an illu- 
sion. In particular, simple methods typically yield performance almost 
as good as more sophisticated methods, to the extent that the difference 
in performance may be swamped by other sources of uncertainty that 
generally are not considered in the classical supervised classification 
paradigm. 

Key words and phrases: Supervised classification, error rate, misclas- 
sification rate, simplicity, principle of parsimony, population drift, se- 
lectivity bias, flat maximum effect, problem uncertainty, empirical com- 
parisons. 



1. INTRODUCTION 

In supervised classification, one seeks to construct 
a rule which will allow one to assign objects to one of a 
prespecified set of classes based solely on a vector 
of measurements taken on those objects. Construc- 
tion of the rule is based on a "design set" or "train- 
ing set" of objects with known measurement vectors 
and for which the true class is also known: one essen- 
tially tries to extract from the design set the infor- 
mation which is relevant to distinguishing between 
the classes in terms of the given measurements. It 
is because the classes are known for the members 
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of this initial data set that the term "supervised" 
is used: it is as if a "supervisor" has provided these 
class labels. 

Such problems are ubiquitous and, as a conse- 
quence, have been tackled in several different re- 
search areas, including statistics, machine learning, 
pattern recognition, computational learning theory 
and data mining. As a result, a tremendous vari- 
ety of algorithms and models has been developed 
for the construction of such rules. A partial list in- 
cludes linear discriminant analysis, quadratic dis- 
criminant analysis, regularized discriminant analy- 
sis, the naive Bayes method, logistic discriminant 
analysis, perceptrons, neural networks, radial ba- 
sis function methods, vector quantization methods, 
nearest neighbor and kernel nonparametric meth- 
ods, tree classifiers such as CART and C4.5, sup- 
port vector machines and rule-based methods. New 
methods, new variants on existing methods and new 
algorithms for existing methods are being developed 
all the time. In addition, different methods for vari- 
able selection, handling missing values and other 
aspects of data preprocessing multiply the number 
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of tools yet further. General theoretical advances 
have also been made which have resulted in im- 
proved performance at predicting the class of new 
objects. These include ideas such as bagging, boost- 
ing and more general ensemble classifiers. Further- 
more, apart from the straightforward development 
of new rules, theory and practice have been devel- 
oped for performance assessment. A variety of crite- 
ria have been investigated, including measures based 
on the receiver operating characteristic (ROC) and 
Brier score, as well as the standard measure of mis- 
classification rate. Subtle estimators of these have 
been developed, such as jackknife, cross-validation 
and a variety of bootstrap methods, to overcome the 
potential optimistic bias which results from simply 
reclassifying the design set. 

An examination of recent conference proceedings 
and journal articles shows that such developments 
are continuing. In part this is because of new compu- 
tational developments that permit the exploration 
of new ideas, and in part it is because of the emer- 
gence of new application domains which present new 
twists on the standard problem. For example, in 
bioinformatics there are often relatively few cases 
but many thousands of variables. In such situations 
the risk of overfitting is substantial and new classes 
of tools are required. General references to work on 
supervised classification include [11, 13, 33, 38, 44]. 

The situation to date thus appears to be one of 
very substantial theoretical progress, leading to deep 
theoretical developments and to increased predictive 
power in practical applications. While all of these 
things are true, it is the contention of this paper that 
the practical impact of the developments has been 
infiated; that although progress has been made, it 
may well not be as great as has been suggested. The 
arguments for this assertion are described in the fol- 
lowing sections. They develop ideas introduced by 
Hand [12, 14, 15, 18, 19] and Jamain and Hand [24]. 
The essence of the argument is that the improve- 
ments attributed to the more advanced and recent 
developments are small, and that aspects of real 
practical problems often render such small differ- 
ences irrelevant, or even unreal, so that the gains re- 
ported on theoretical grounds, or on empirical com- 
parisons from simulated or even real data sets, do 
not translate into real advantages in practice. That 
is, progress is far less than it appears. 

These ideas are described in four steps. 

First, model-fitting is a sequential process of pro- 
gressive refinement, which begins by describing the 



largest and most striking aspects of the data struc- 
ture, and then turns to progressively smaller aspects 
(stopping, one hopes, before the process begins to 
model idiosyncrasies of the observed sample of data 
rather than aspects of the true underlying distribu- 
tion). In Section 2 we show that this means that the 
large gains in predictive accuracy in classification 
are won using relatively simple models at the start of 
the process, leaving potential gains which decrease 
in size as the modeling process is taken further. All 
of this means that the extra accuracy of the more 
sophisticated approaches, beyond that attained by 
simple models, is achieved from "minor" aspects of 
the distributions and classification problems. 

Second, in Section 3 we argue that in many, per- 
haps most, real classification problems the data points 
in the design set are not, in fact, randomly drawn 
from the same distribution as the data points to 
which the classifier will be applied. There are many 
reasons for this discrepancy, and some are illustrated. 
It goes without saying that statements about classi- 
fier accuracy based on a false assumption about the 
identity of the design set distribution and the dis- 
tribution of future points may well be inaccurate. 

Third, when constructing classification rules, var- 
ious other assumptions and choices are often made 
which may not be appropriate and which may give 
misleading impressions of future classifier performance. 
For example, it is typically assumed that the classes 
are objectively defined, with no arbitrariness or un- 
certainty about the class labels, but this is some- 
times not the case. Likewise, parameters are often 
estimated by optimizing criteria which are not rele- 
vant to the real aim of classification accuracy. Such 
issues are described in Section 4 and, once again, it 
is obvious that these introduce doubts about how 
the claimed classifier performance will generalize to 
real problems. 

The phenomena with which we are concerned in 
Sections 3 and 4 are related to the phenomenon of 
overfitting. A model overfits when it models the de- 
sign sample too closely rather than modeling the dis- 
tribution from which this sample is drawn. In Sec- 
tions 3 and 4 we are concerned with situations in 
which the models may accurately refiect the design 
distributions (so they do not underfit or overfit), but 
where they fail to recognize that these distributions, 
and the apparent classification problems described, 
are in fact merely a single such problem drawn from 
a notional distribution of problems. The real aim 
might be to solve a rather different problem. One 
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might thus describe the issue as one of problem un- 
certainty. To take a famihar example, which we do 
not explore in detail in this paper because it has 
been explored elsewhere, the relative costs of differ- 
ent kinds of misclassification may differ and may be 
unknown. A very common resolution is to assume 
equal costs (Jamain and Hand [24] found that most 
comparative studies of classification rules made this 
assumption) and to use straightforward error rate 
as the performance criterion. However, equality is 
but one choice, and an arbitrary one at that, and 
one which we suspect is in fact rarely appropriate. 
In assuming equal costs, one is adopting a particu- 
lar problem which may not be the one which is re- 
ally to be solved. Indeed, things are even worse than 
this might suggest, because relative misclassification 
costs may change over time. Provost and Fawcett 
[36] have described such situations: "Comparison of- 
ten is difficult in real- world environments because 
key parameters of the target environment are not 
known. The optimal cost/benefit tradeoffs and the 
target class priors seldom are known precisely, and 
often are subject to change (Zahavi and Levin [47]; 
Friedman and Wyatt [8] ; Klinkenberg and Thorsten 
[29]). For example, in fraud detection we cannot ig- 
nore misclassification costs or the skewed class dis- 
tribution, nor can we assume that our estimates are 
precise or static (Fawcett and Provost [6])." 

Moving on, our fourth argument is that classifi- 
cation methods are typically evaluated by report- 
ing their performance on a variety of real data sets. 
However, such empirical comparisons, while superfi- 
cially attractive, have major problems which are of- 
ten not acknowledged. In general, we suggest in Sec- 
tion 5 that no method will be universally superior 
to other methods: relative superiority will depend 
on the type of data used in the comparisons, the 
particular data sets used, the performance criterion 
and a host of other factors. Moreover, the relative 
performance will depend on the experience the per- 
son making the comparison has in using the meth- 
ods, and this experience may differ between meth- 
ods: researcher A may find that his favorite method 
is best, merely because he knows how to squeeze the 
best performance from this method. 

These various arguments together suggest that an 
apparent superiority in classification accuracy, ob- 
tained in "laboratory conditions," may not trans- 
late to a superiority in real-world conditions and, in 
particular, the apparent superiority of highly sophis- 
ticated methods may be illusory, with simple meth- 



ods often being equally effective or even superior in 
classifying new data points. 

2. MARGINAL IMPROVEMENTS 

This section demonstrates that the extra perfor- 
mance to be achieved by more sophisticated classifi- 
cation rules, beyond that attained by simple meth- 
ods, is small. It follows that if aspects of the classi- 
fication problem are not accurately described (e.g., 
if incorrect distributions have been used, incorrect 
class definitions have been adopted, inappropriate 
performance comparison criteria have been applied, 
etc.), then the reported advantage of the more so- 
phisticated methods may be incorrect. Later sec- 
tions illustrate how some inaccuracies in the clas- 
sification problem description can arise. 

2.1 A Simple Example 

Statistical modeling is a sequential process in which 
one gradually refines the model to provide a better 
and better fit to the distributions from which the 
data were drawn. In general, the earlier stages in this 
process yield greater improvement in model fit than 
later stages. Furthermore, if one looks at the histor- 
ical development of classification methods, then the 
earlier approaches involve relatively simple struc- 
tures (e.g., the linear forms of linear or logistic dis- 
criminant analysis), while more recent approaches 
involve more complicated structures (e.g., the de- 
cision surfaces of neural networks or support vector 
machines). It follows that the simple approaches will 
have led to greater improvement in predictive per- 
formance than the later approaches which are nec- 
essarily trying to improve on the predictive perfor- 
mance obtained by the simpler earlier methods. Put 
another way, there is a law of diminishing returns. 

Although this paper is concerned with supervised 
classification problems, it is illuminating to examine 
a simple regression case. Suppose that we have a 
single response variable y which is to be predicted 
from d variables (xi, . . . , Xd)'^ = x. Suppose also that 
the correlation matrix of (x-^,y)^ has the form 

(2.1) ■ ■ 





Sl2 




.^21 


^22. 





with Ell = (1 — p)I + pll , Si2 = E21 = T and S22 = 
1, where I is the d x d identity matrix, 
1 = (1, . . . , 1)"^ of length d and r = (r, . . . , r)-^ of 
length d. That is, the correlation between each pair 
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of predictor variables is p, and the correlation be- 
tween each predictor variable and the response vari- 
able is r. Suppose also that p, r > 0. This condition 
is not necessary for the argument which follows; it 
merely allows us to avoid some detail. 

Let V{d) be the conditional variance of y given the 
values of d predictor variables x, as above. Standard 
results give this conditional variance as 

(2.2) y(d) = S22-S2i5]r/Si2. 

Using the result that 
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From this it follows that the reduction in conditional 
variance due to adding an extra predictor variable, 
Xd+i (also correlated p with the other predictors and 
T with the response variable), is 

X{d + I) =V{d) - V{d + 1) 

_2 



(2.5) 
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< p < 1 must 



is reasonably strong mutual correlation between the 
predictor variables, the earliest ones contribute sub- 
stantially more to the reduction in variance remain- 
ing unexplained than do the later ones. The case 
p = consists of a diagonal straight line running 
from 1 down to zero. In the case p = 0.9, almost all 
of the variance in the response variable is explained 
by the first chosen predictor. 

This example shows that the reduction in con- 
ditional variance of the response variable decreases 
with each additional predictor we add, even though 
each predictor has an identical correlation with the 
response variable (provided this correlation is greater 
than 0). The reason for the reduction is, of course, 
the mutual correlation between the predictors: much 
of the predictive power of a new predictor has al- 
ready been accounted for by the existing predictors. 

In real applications, the situation is generally even 
more pronounced than in this illustration. Usually, 
in real applications, the predictor variables are not 
identically correlated with the response, and the pre- 
dictors are selected sequentially, beginning with those 
which maximally reduce the conditional variance. In 
a sense, then, the example above provides a lower 
bound on the phenomenon: in real applications the 
proportion of the gains attributable to the early 
steps is even greater. 

2.2 Decreasing Bounds on Possible Improvement 

We now return to supervised classification. For 
illustrative purposes, suppose that misclassification 
rate is the performance criterion, although similar 
arguments apply with other criteria. Ignoring issues 
of overfitting, adding additional predictor variables 
can only lead to a decrease in misclassification rate. 



Note that the condition —{d — 1 
still be satisfied when d is increased. 
Now consider two cases: 

Case 1. When the predictor variables are uncor- 
related, p = 0. From (2.5), we obtain X{d + 1) = r^. 
That is, if the predictor variables are mutually un- 
correlated and each has correlation r with the re- 
sponse variable, then each additional predictor re- 
duces the variance of the conditional variance of y given 
the predictors by r^. [Of course, by setting p = in 
(2.4) we see that this is only possible up to d = r"^ 
predictors. With this many predictors the condi- 
tional variance of y given x has been reduced to 
zero.] 

Case 2. p>0. Plots of V{d) for r = 0.5 and for a 
range of p values are shown in Figure 1 . When there 
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rN?=0.5 




rho«0.3 
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Fig. 1. Conditional variance of response variable as addi- 
tional predictors are added for t = 0.5. A range of values of p 
is shown. 
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The simplest model is that which uses no predictors, 
leading, in the two-class case, to a misclassification 
rate of mo = ttq , where ttq is the prior probability 
of the smaller class. Suppose that a predictor vari- 
able is now introduced which has the effect of re- 
ducing the misclassification rate to mi < mo- Then 
the scope for further improvement is only mi, which 
is less than the original scope mo- Furthermore, if 
mi < TTiQ — mi, then all future additions necessar- 
ily improve things by less than the first predictor 
variable. In fact, things are even more extreme than 
this: one cannot further reduce the misclassification 
rate by more than mi — mb, where mb is the Bayes 
error rate. To put it another way, at each step the 
maximum possible increase in predictive power de- 
creases, so it is not surprising that, in general, at 
each step the additional contribution to predictive 
power decreases. 

2.3 Effectiveness of Simple Classifiers 

Although the literature contains examples of ar- 
tificial data which simple models cannot separate 
(e.g., intertwined spirals or checkerboard patterns), 
such data sets are exceedingly rare in real life. Con- 
versely, in the two-class case, although few real data 
sets have exactly linear decision surfaces, it is com- 
mon to find that the centroids of the predictor vari- 
able distributions of the classes are different, so that 
a simple linear surface can do surprisingly well as an 
estimate of the true decision surface. This may not 
be the same as "can do surprisingly well in classify- 
ing the points," since in many problems the Bayes 
error rate is high, meaning that no decision sur- 
face can separate the distributions of such problems 
very well. However, it means that the dramatic steps 



in improvement in classifier accuracy are made in 
the simple first steps. This is a phenomenon which 
has been noticed by others (e.g., Rendell and Se- 
shu [37]; Shavlik, Mooney and Towell [41]; Mingers 
[34]; Weiss, Galen and Tadepalh [45]; Holte [22]). 
Holte [22], in particular, carried out an investigation 
of this phenomenon. His "simple classifier" (called 
IR) consists of a partition of a single variable, with 
each cell of the partition possibly being assigned to 
a different class: it is a multiple-split single-level tree 
classifier. A search through the variables is used to 
find that which yields the best predictive accuracy. 
Holte compared this simple rule with C4.5, a more 
sophisticated tree algorithm, finding that "on most 
of the datasets studied, li?'s accuracy is about 3 
percentage points lower than C4's." 

We carried out a similar analysis. Perhaps the 
earliest classification method formally developed is 
Fisher's linear discriminant analysis [7]. Table 1 shows 
misclassification rates for this method and for the 
best performing method we could find in a search 
of the literature (these data were abstracted from 
the data accumulated by Jamain [23] and Jamain 
and Hand [24]) for a randomly selected sample of 
ten data sets. The first numerical column shows the 
misclassification rate of the best method we found 
(ttt-t), the second shows that of linear discriminant 
analysis (m^), the third shows the default rule of as- 
signing every point to the majority class (mo) and 
the final column shows the proportion of the dif- 
ference between the default rule and the best rule 
which is achieved by linear discriminant analysis 
[(mo — mi)/(mo — m^)]- It is likely that the best 
rules, being the best of rules which many researchers 
have applied, are producing results near the Bayes 
error rate. 



Table 1 

Performance of linear discriminant analysis and the best result we found on ten 
randomly selected data sets 



Data set 


Best method e.r. 


Lindisc e.r. 


Default rule 


Prop linear 


Segmentation 


0.0140 


0.083 


0.760 


0.907 


Pima 


0.1979 


0.221 


0.350 


0.848 


House- votesl6 


0.0270 


0.046 


0.386 


0.948 


Vehicle 


0.1450 


0.216 


0.750 


0.883 


Sat image 


0.0850 


0.160 


0.758 


0.889 


Heart Cleveland 


0.1410 


0.141 


0.560 


1.000 


Splice 


0.0330 


0.057 


0.475 


0.945 


Waveform21 


0.0035 


0.004 


0.667 


0.999 


Led7 


0.2650 


0.265 


0.900 


1.000 


Breast Wisconsin 


0.0260 


0.038 


0.345 


0.963 
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The striking thing about Table 1 is the large val- 
ues of the percentages of classification accuracy gainei 
by simple linear discriminant analysis. The lowest 
percentage is 85% and in most cases over 90% of 
the achievable improvement in predictive accuracy, 
over the simple baseline model, is achieved by the 
simple linear classifier. 

I am grateful to Willi Sauerbrei for pointing out 
that when the error rates of both the best method 
and the linear method are small, the large propor- 
tion in achievable accuracy which can be obtained 
by the linear method corresponds to the error rate 
of the linear method being a large multiple of that 
of the best method. For example, in the most ex- 
treme case in Table 1, the results for the segmenta- 
tion data show that the linear discrimination error 
rate is nearly six times that of the best method. On 
the other hand, when the error rates are small, this 
large difference will correspond to only a small pro- 
portion of new data points. Small differences in error 
rate are susceptible to the issues raised in Sections 3 
and 4: they may vanish when problem uncertainties 
are taken into account. 

2.4 The Flat Maximum Effect 

Even within the context of classifiers defined in 
terms of simple linear combinations of the predictor 
variables, it has often been observed that the ma- 
jor gains are made by (for example) weighting the 
variables equally, with only little further gains to 
be had by careful optimization of the weights. This 
phenomenon has been termed the flat maximum ef- 
fect [13, 43]: in general, often quite large deviations 
from the optimal set of weights will yield predictive 
performance not substantially worse than the opti- 
mal weights. An informal argument that shows why 
this is often the case is as follows. 

Let the predictor variables be (xi, . . . , x^)-^ = x 
and, for simplicity, assume that E{xi) = and 
V{xi) = 1 for i = 1, . . . , d. Let S = {r}ij be the cor- 
relation matrix between these variables. Now define 
two weighted sums 

d d 
w = '^^WiXi and v = VjXi, 

i=l i=l 

using respective weight vectors {wi, . . . , Wd) and {vi, . 
In general, r^w, v), the correlation between w and v, 
can take extreme values of and —1, but suppose 
we restrict the weights to be nonnegative, Wi,Vi > 



for i = 1, . . . , d, and also require = 1 and = 
1. Using these conditions, a little algebra shows that 

r{v,w) >^^Vi'Wjr{xi,Xj). 
i j 

Now, with equal weights, Vi = l/d,i = 1, . . . ,re, we 
obtain 

r{v, w)>^YU2 Wj^ixi, Xj) 

i j 
i j 

where k = arg min^ r(xi ,Xj). 
From this, 

r{v,w) > ^YlYl'^J'^i^i^^k) 

i j 
i 

In words, the correlation between an arbitrary weighted 
sum of the x variables (with weights summing to 
1) and the simple combination using equal weights 
is bounded below by the smallest row average of 
the entries in the correlation matrix of the x vari- 
ables. Hence if the correlations are all high, the sim- 
ple average will be highly correlated with any other 
weighted sum: the choice of weights will make little 
difference to the scores. The gain to be made by the 
extra effort of optimizing the weights may not be 
worth the effort. 

- o 

a 

- 9. 
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LU 
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D 

' — I 1 1 1 ' 

5 to 15 

Number ot hidden nodes 

Fig. 2. Effect on misclassification rate of increasing the 
number of hidden nodes in a neural network to predict the 
class of the sonar data. 
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Number o( leaves 

Fig. 3. Effect on misclassification rate of increasing the 
number of leaves in a tree classifier to predict the class of 
the sonar data. 

2.5 An Example 

As a simple illustration of how increasing model 
complexity leads to a decreasing rate of improve- 
ment, we fitted models to the sonar data from the 
University of California, Irvine (UCI) data base. This 
data set consists of 208 observations. 111 of which 
belong to the class "metal" and 97 of which belong 
to the class "rock." There are 60 predictor variables. 
The data were randomly divided into two parts, and 
a succession of neural networks with increasing num- 
bers of hidden nodes was fitted to half of the data, 
with the other half being used as a test set. The er- 
ror rates are shown in Figure 2. The left-hand point, 
corresponding to nodes, is the baseline misclassi- 
fication rate achieved by assigning everyone in the 
test set to the larger class. The error bars are 95% 
confidence intervals calculated from 100 networks in 
each case. Figure 3 shows a similar plot, but this 
time for a recursive partitioning tree classifier ap- 
plied to the same data. The horizontal axis shows 
increasing numbers of leaf nodes. Standard methods 
of tree construction were used, in which a large tree 
is pruned back to the requisite number of nodes. In 
both of these figures we see the dramatic improve- 
ment arising from fitting the first nontrivial model. 
This far exceeds the subsequent improvement ob- 
tained in any later step. 

3. DESIGN SAMPLE SELECTION 

Intrinsic to the classical supervised classification 
paradigm is the assumption that the data in the 
design set are randomly drawn from the same dis- 
tribution as the points to be classified in the future. 



Sometimes slight variants of the sampling scheme 
are used, for example, drawing samples separately 
from each class, but the assumption that future points 
to be classified are drawn from the same distribu- 
tions as the design set is always made. Unfortu- 
nately, as we illustrate in this section, there are sev- 
eral reasons why this assumption may not be justi- 
fied. In fact, as with our suggestion that the com- 
mon choice of equal misclassification costs may be 
more often inappropriate than appropriate, we sus- 
pect that the assumption that the design distribu- 
tion is representative of the distribution from which 
future points will be drawn is perhaps more often 
incorrect than correct. 

If the distribution underlying the design data and 
that underlying future points to be classified do dif- 
fer, then elaborate optimization of the classifier us- 
ing the design data may be wasted effort: the per- 
formance difference between two classifiers may be 
irrelevant in the context of the differences arising 
between the design and future distributions. In par- 
ticular, we suggest, more sophisticated classifiers, 
which almost by definition model small idiosyncrasies 
of the distribution underlying the design set, will 
be more susceptible to wasting effort in this way: 
the grosser features of the distributions (modeled 
by simpler methods) are more likely to persist than 
the smaller features (modeled by the more elaborate 
methods). 

3.1 Population Drift 

A fundamental assumption of the classical paradigm 
is that the various distributions involved do not change 
over time. In fact, in many applications this is unre- 
alistic and the population distributions are nonsta- 




FlG. 4. Evolution of misclassification rate of a classifier built 
at the start of the period. 
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tionary. For example, it is unrealistic in most com- 
mercial applications concerned with human behav- 
ior: customers will change their behavior with price 
changes, with changes to products, with changing 
competition and with changing economic conditions. 
Hoadley [21] remarked "the test sample is supposed 
to represent the population to be encountered in the 
future. But in reality, it is usually a random sample 
of the current population. High performance on the 
test sample does not guarantee high performance on 
future samples, things do change" and "there is al- 
ways a chance that a variable and its relationships 
will change in the future. After that, you still want 
the model to work. So don't make any variable dom- 
inant." He is cautioning against making the model 
fit the design distribution too well. The last point 
about not making any variable dominant is related 
to the flat maximum effect, described above. 

Among the most important reasons for changes 
to the distribution of applicants are changes in mar- 
keting and advertising practices. Changes to the dis- 
tributions that describe the customers explain why, 
in the credit scoring and banking industries [16, 20, 
39, 42], the classification rules used to predict which 
applicants are likely to default on loans are updated 
every few months: their performance degrades, not 
because the rules themselves change, but because 
the distributions to which they are being applied 
change [27]. 

An example of this is given in Figure 4. The avail- 
able data consisted of the true classes ("bad" or 
"good") and the values of 17 predictor variables 
for 92,258 customers taking out unsecured personal 
loans with a 24-month term given by a major UK 
bank during the period 1 January 1993 to 30 Novem- 
ber 1997; 8.86% of the customers belonged to the 
bad class. The figure shows how the misclassifica- 
tion rate for a classification rule built on data just 
preceding the start of the displayed period changed 
over time. Since the coefficients of the classifier were 
not changing, the deterioration in performance must 
be due to shifts in the distributions of customers 
over time. 

An illustration of how this "population drift" phe- 
nomenon affects different classifiers differentially is 
given in Figure 5. For the purposes of this illustra- 
tion we used a linear discriminant analysis (LDA) 
as a simple classifier and a tree model as a more 
complicated classifier. For the design set we used 
customers 1,3,5,7,..., 4999. We then applied the 
classifiers to alternate customers, beginning with the 



second, up to the 60,000th customer. This meant 
that different customers were used for designing and 
testing, even during the initial period, so that there 
would be no overfitting in the reported results. Fig- 
ure 5 shows lowess smooths of the misclassification 
cost [i.e., misclassification rate, with customers from 
each class weighted so that cq/ci = tti/tto, where q 
is the cost of misclassifying a customer from class 
i and vTj is the prior (class size) of class i]. As can 
be seen from the figure, the tree classifier (the lower 
curve) is initially superior (has smaller loss), but af- 
ter a time its superiority begins to fade. Superficial 
examination of the figure might suggest that the ef- 
fect takes a long time to become apparent, not re- 
ally manifesting itself until around the 40,000th cus- 
tomer, but consider that, in an application such as 
this, the data are always retrospective. In the present 
case, one cannot determine the true class until the 
entire 24 month loan term has elapsed. [In fact, of 
course, this is not quite true: if a customer defaults 
before the end of the term, then their class (bad) is 
known, but otherwise their true (good or bad) class 
is not known until the end, so that to obtain an unbi- 
ased sample, one has to wait until the end. Survival 
analysis models can be constructed to allow for this, 
but that is leading us away from the point.] For our 
problem, to accumulate an unbiased sample of 5000 
customers with known true outcome, one would have 
to wait until two years after the 5000th customer 
had been accepted. In terms of the horizontal axis in 
Figure 5, this means that the model would be built, 
and would be initially used at around the time that 
the 40,000th customer was being considered. The 
figure shows that this is just when the model de- 
grades. The changes in population structure which 
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Fig. 5. Lowess smooths of cost-weighted misclassification 
rate for a tree model and LDA applied to customers 
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occurred during the two years which elapsed while 
we waited for the true classes of the 5000 design set 
customers to become known have reduced any ad- 
vantage that the more sophisticated tree model may 
have. 

In summary, the apparent superiority of the more 
sophisticated tree classifier over the very simple lin- 
ear discriminant classifier is seen to fade when we 
take into account the fact that the classifiers must 
necessarily be applied in the future to distributions 
which are likely to have changed from those which 
produced the design set. Since, as demonstrated in 
Section 2, the simple linear classifier captures most 
of the separation between the classes, the additional 
distributional subtleties captured by the tree method 
become less and less relevant when the distributions 
drift. Only the major aspects are still likely to hold. 

The impact of population drift on supervised clas- 
sification rules is nicely described by the Ameri- 
can philosopher Eric Hoffer, who said, "In times of 
change, learners inherit the Earth, while the learned 
find themselves beautifully equipped to deal with a 
world that no longer exists." 

3.2 Sample Selectivity Bias 

The previous subsection considered the impact on 
classification rules of distributions which changed 
over time. There is little point in optimizing the rule 
to the extent that it models aspects of the distribu- 
tions and decision surface which are likely to have 
changed by the time the rule is applied. Similar futil- 
ity applies if a selection process means that the de- 
sign sample is drawn from a distribution distorted 
in some way from that to which the classification 
rule is to be applied. In fact, I suspect that this may 
be common. Consider, for example, a classification 
rule aimed at differential medical diagnosis or med- 
ical screening. The rule will have been developed on 
a sample of cases (including members of each class). 
Perhaps these cases will be drawn from a particu- 
lar hospital, clinic or health district. Now all sorts 
of demographic, social, economic and other factors 
influence who seeks and is accepted for treatment, 
how severe the cases being treated are, how old they 
are and so on. In general, it would be risky to as- 
sume that these selection criteria are the same for 
all hospitals, clinics or health districts. This means 
that the fine points of the classification rule are un- 
likely to hold. One might expect its coarser features 
to be true across different such sets of cases, but 



the detailed aspects will reflect particular proper- 
ties of the population from which the design data 
were drawn. In fact, there are some subtleties here. 
Suppose that the classification rule follows the diag- 
nostic paradigm [directly modeling p(c|x), the prob- 
ability of class membership, c, given the descriptor 
vector x] , rather than the sampling paradigm [which 
models p{c\x) indirectly from the p(x|c) using Bayes' 
theorem] . Then if x spans the space of all predictors 
of class membership and if the model form chosen 
for p(c|x) includes the "true" model, then sampling 
distortions based on x alone will not adversely in- 
fluence the classifier: the classifier built in one clinic 
will also apply elsewhere. Of course, it would be a 
brave person who could confidently assert that these 
two conditions held. Such subtleties aside, what this 
means, again, is that effort spent on overrefining the 
classification model is probably wasted effort and, 
in particular, that fine differences between different 
classification rules should not be regarded as carry- 
ing much weight. 

This problem of sample selection and how it might 
be tackled has been the subject of intensive research, 
especially by the medical statistics and economet- 
rics communities, but appears not to have been of 
great concern to researchers on classification meth- 
ods. Having said that, one area that involves sample 
selectivity in classification problems which has at- 
tracted research interest arises in the retail financial 
services industry, as in the previous section. Here, as 
in that section, the aim is to predict, for example, 
on the basis of application and other background 
variables, whether or not an applicant is likely to 
be a good customer. Those expected to be good are 
accepted, and those expected to be bad are rejected. 
For those that have been accepted, we subsequently 
discover their true good or bad class. For the re- 
jected applicants, however, we never know whether 
they are good or bad. The consequence is that the 
resulting sample is distorted as a sample from the 
population of applicants, which is our real interest 
for the future. Measuring the performance or at- 
tempting to build an improved classification rule us- 
ing those individuals for which we do know the true 
class (which is needed for supervised classification) 
has the potential to be highly misleading for the 
overall applicant population. In particular, it means 
that using highly sophisticated methods to squeeze 
subtle information from the design data is pointless. 
This problem is so ubiquitous in the personal finan- 
cial services sector that it has been given its own 
name — reject inference [17]. 
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4. PROBLEM UNCERTAINTY 

Section 3 looked at mismatches between the dis- 
tributions modeled by the classification rule and the 
distributions to which it was applied. This is an ob- 
vious way in which things may go awry, but there 
are many others, perhaps not so obvious. This sec- 
tion illustrates just three. 

4.1 Errors in Class Labels 

The classical supervised classification paradigm is 
based on the assumption that there are no errors 
in the true class labels. If one expects errors in the 
class labels, then one can attempt to build models 
which explicitly allow for this, and there has been 
work to develop such models. Difficulties arise, how- 
ever, when one does not expect such errors, but they 
nevertheless occur. 

Suppose that, with two classes, the true posterior 
class probabilities are p(l|x) and p(2|x), and that 
a (small) proportion 6 of each class is incorrectly 
believed to come from the other class at each x. 
Denoting the apparent posterior probability of class 
1 by p*(l|x), we have 

p*{l\x) = {l-S)p{l\x) + Sp{2\x). 

It follows that if we let r(x) =p(l|x)/p(2|x) denote 
the true odds and let r*(x) =p*(l|x)/p*(2|x) denote 
the apparent odds, then 



with £ = (5/(1 -'^)- 

With small e, (4.1) is monotonic increasing in 
r(x), so that contours of r(x) map to correspond- 
ing contours of r*(x). In particular, if the true op- 
timal decision surface is r(x) = k (k is determined 
by the relative misclassification costs) , then the opti- 
mal decision surface when errors are present is given 
by r*(x) = k*, with k* = {k + e)/{ek + I). Unfor- 
tunately, if the occurrence of mislabeling is unsus- 
pected, then r*(x) will be compared with k rather 
than k*. In the case of equal misclassification costs, 
so that A; = 1, we have k* = k = 1, so that no prob- 
lems arise from the misclassification. (Indeed, ad- 
vantages can even arise: see [9].) However, what hap- 
pens if A; 7^ 1? It is easy to show that r*(x) > r(x) 
whenever r(x) < 1 and that r*(x) < r(x) whenever 
r(x) > 1. That is, the effect of the errors in class la- 
bels is to shrink the posterior class odds toward 1, so 
that comparing r*(x) with k rather than k* is likely 



to lead to worse performance. There is also a sec- 
ondary issue, that the shrinkage of r(x) will make 
it less easy to estimate the decision surface accu- 
rately because it is a flatter surface: the variance 
of the estimated decision surface, from sample to 
sample, will be greater when there is mislabeling of 
classes. In such circumstances it is better to stick to 
simpler models, since the higher order terms of the 
more complicated models will be very inaccurately 
estimated. 

4.2 Arbitrariness in the Class Definition 

The classical supervised classification paradigm 
also takes as fundamental the fact that the classes 
are well defined. That is, that there is some fixed 
clear external criterion which is used to produce the 
class labels. In many situations, however, this is not 
the case. In particular, when the classes are defined 
by thresholding a continuous variable, then there 
is always the possibility that the defining threshold 
might be changed. Once again, this situation arises 
in consumer credit, where it is common to define a 
customer as "defaulting" if they fall three months in 
arrears with repayments. This definition, however, 
is not a qualitative one (contrast has a tumor/does 
not have a tumor) but is very much a quantitative 
one. It is entirely reasonable that alternative defini- 
tions (e.g., four months in arrears) might be more 
useful if economic conditions were to change. This 
is a simple example, but in many situations much 
more complex class definitions based on logical com- 
binations of numerical attributes, split at fairly ar- 
bitrary thresholds, are used. For example, student 
grades are often based on levels of performance in 
continuous assessment and examinations. In detect- 
ing vertebral deformities in studies of osteoporosis, 
the ranges of the anterior, posterior and mid heights 
of the vertebra, as well as functions of these, such as 
ratios, are combined in quite complicated Boolean 
conditions to provide the definition (e.g., [10]). Def- 
initions formed in this sort of way are particularly 
common in situations that involve customer man- 
agement. For example, Lewis [31] defined a good 
account in a revolving credit operation (such 
credit card) as someone whose billing account shows 
(a) on the books for a minimum of 10 months, (b) ac- 
tivity in 6 of the most recent 10 months, (c) pur- 
chases of more than $50 in at least 3 of the past 24 
months and (d) not more than once 30 days delin- 
quent in the past 24 months. A bad account is de- 
fined as (a) delinquent for 90 days or more at any 
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time with an outstanding undisputed balance of $50 
or more, (b) delinquent three times for 60 days in the 
past 12 months with an outstanding undisputed bal- 
ance on each occasion of $50 or more or (c) bankrupt 
while the account was open. Li and Hand [32] gave 
an even more complicated example from retail bank- 
ing. 

Our concern with these complicated definitions is 
that they are fairly arbitrary: the thresholds used 
to partition the various continua are not natural 
thresholds, but are imposed by humans. It is entirely 
possible that, retrospectively, one might decide that 
other thresholds would have been better. Ideally, un- 
der such circumstances, one would go back to the 
design data, redefine the classes and recompute the 
classification rule. However, this requires that the 
raw data have been retained at the level of the un- 
derlying continua used in the definitions. This is of- 
ten not the case. The term concept drift is some- 
times used to describe changes to the definitions of 
the classes. See, for example, the special issue of Ma- 
chine Learning (1998, Vol. 32, No. 2), Widmer and 
Kubat [46] and Lane and Brodley [30]. The prob- 
lem of changing class definitions has been examined 
in [25, 26] and [28]. 

If the very definitions of the classes may change 
between designing the classification rule and apply- 
ing it, then clearly there is little point in developing 
an overrefined model for the class definition which is 
no longer appropriate. Such models fail to take into 
account all sources of uncertainty in the problem. Of 
course, this does not necessarily imply that simple 
models will yield better classification results: this 
will depend on the nature of the difference between 
the design and application class definitions. How- 
ever, there are similarities to the overfitting issue. 
Overfitting arises when a complicated model faith- 
fully reflects aspects of the design data to the extent 
that idiosyncrasies of that data, rather than merely 
of the distribution from which the data arose, are 
included in the model. Then simple models, which 
fit the design data less well, lead to superior clas- 
sification. Likewise, in the present context, a model 
optimized on the design data class definition is re- 
flecting idiosyncrasies of the design data which may 
not occur in application data, not because of random 
variation, but because of the different definitions of 
the classes. Thus it is possible that models which 
fit the design data less well will do better in future 
classification tasks. 



The possibility of arbitrariness in the class defini- 
tion discussed in this section is quite distinct from 
the possibility of class priors or relative misclassifi- 
cation costs being changed — referred to in the quote 
from Provost and Fawcett [36] above — but the pos- 
sibility of these changes, also, casts doubt on the 
wisdom of modeling the problem too precisely, that 
is, of using models which are too sophisticated. 

4.3 Optimization Criteria and Performance 
Assessment 

When fitting a model to a design set, one op- 
timizes some criterion of goodness of fit (perhaps 
modified by a penalization term to avoid overfit- 
ting) or of classification performance. Many such 
measures are in use, including likelihood, misclas- 
sification rate, cost-weighted misclassification rate. 
Brier score, log score and area under the ROC curve. 
Unfortunately, it is not difficult to contrive data 
sets for which different optimization criteria lead to 
(e.g.) linear decision surfaces with very different ori- 
entations (even to the extent of being orthogonal). 
Benton [[2], Chap. 4] illustrated this for several real 
data sets. Clearly, then it is important to specify 
the criterion to be used when building a classifica- 
tion rule. If the use to which the model will be put is 
well specified to the extent that a measure of perfor- 
mance can be precisely defined, then this measure 
should determine the criterion of goodness of fit. 
All too often, however, there is a mismatch between 
the criterion used to choose the model, the criterion 
used to evaluate its performance, and the criterion 
which actually matters in real application. For ex- 
ample, a common approach might be to use like- 
lihood to estimate a model's parameters, use mis- 
classification rate to assess its performance and use 
some cost-weighted misclassification rate in practice 
(e.g., some combination of specificity and sensitiv- 
ity). In circumstances such as these, it would clearly 
be pointless to refine the model to a high degree of 
accuracy from a likelihood perspective, when this 
may be only weakly related to the real performance 
objective. 

Having said that, one must acknowledge that of- 
ten precise details of how performance is to be mea- 
sured in the future cannot be given. For example, 
in most applications it is difficult to give more than 
general statements about the relative costs of differ- 
ent kinds of misclassifications. In such cases it might 
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be worthwhile to choose a criterion that is equiva- 
lent to averaging over a range of possible costs: like- 
lihood, the area under a receiver operating charac- 
teristic curve and the weighted version of the latter 
described in [1] can all be regarded as attempts to 
do that. 

5. INTERPRETING EMPIRICAL 
COMPARISONS 

There have been a great many empirical compar- 
isons of the performance of different kind of classi- 
fication rules. Some of these are in the context of 
a new method having been developed and the ef- 
fort to gain some understanding of how it performs 
relative to existing methods. Other comparisons are 
purely comparative studies, seeking to make disin- 
terested comparative statements about the relative 
merits of different methods. At first glance, such 
comparative studies are useful in shedding light on 
the different methods, on which generally yield su- 
perior performance or on which are to be preferred 
for particular kinds of data or in particular domains. 
However, on closer examination, such comparisons 
have major weaknesses and can even be seriously 
misleading. Various authors have drawn attention 
to these problems, including Duin [4], Salzberg [40], 
Hand [13], Hoadley [21] and Efron [5], so we will 
only briefly mention some of the main points here; 
in particular, only those points relative to classifi- 
cation accuracy, rather than other aspects of per- 
formance. Jamain and Hand [24] also gave a more 
detailed review of comparative studies of classifica- 
tion rules. 

Different categories of users might be expected to 
obtain different rankings of classification methods 
in comparative studies. For example, we can con- 
trast an expert user, who will be able to fine-tune 
methods, with an inexperienced user, perhaps some- 
one who has simply pulled some standard public- 
domain software from the web. It would probably 
be surprising if their rankings did not differ. More- 
over, experts will tend to have particular expertise 
with particular classes of method. Someone expert 
in neural networks may well achieve superior re- 
sults with those methods than with support vec- 
tor machines and vice versa. Taken to an extreme, 
of course, many comparative studies are made to 
establish the performance and properties of newly 
invented methods — by their inventors. One might 
expect substantial bias in favor of the new methods. 



compared to what others might be able to achieve, 
in such studies. Duin [4] pointed out the difficulty 
of comparing, "in a fair and objective way," classi- 
fiers which require substantial input of expertise (so 
that domain knowledge can be taken advantage of) 
and classifiers which can be applied automatically 
with little external input of expertise. The two ex- 
tremes (of what is really a continuum, of course) are 
appropriate in different circumstances. 

The principle of comparing methods by applying 
them to a collection of disparate real data sets is use- 
ful, but has its weaknesses. An obvious one is that 
different studies use different collections of data sets, 
so making comparisons difficult. Furthermore, the 
collection will not be representative of real data sets 
in any formal sense. Moreover, a potential user is 
not really interested in some "average performance" 
over distinct types of data, but really wants to know 
what will be good for his or her problem, and differ- 
ent people have different problems, with data aris- 
ing from different domains. A given method may be 
very poor on most kinds of data, but very good for 
certain problems. 

The widespread use of standard collections of data 
sets (such as the UCI repository [35]) has clear mer- 
its: new methods can be compared with earlier ones 
on a level playing field. However, this also means 
that there will be some overfitting both to the in- 
dividual data sets in the collection and to the col- 
lection as a whole. That is, some methods will do 
well on data sets in the collection purely by chance. 
Indeed, the more successful the collection is in the 
sense that more and more people use it for compara- 
tive assessments, the more serious this problem will 
become. 

Jamain and Hand [24] pointed out the difficulty 
of saying exactly what a classification "method" is. 
Is a neural network with a single hidden node to be 
regarded as from the same family as one with an 
arbitrary number of hidden nodes? It is clearly not 
exactly the same method. Comparative evaluations 
using the two models may well yield very different 
classification results. It is this sort of phenomenon 
which explains why the comparative performance 
literature contains many different results for "the 
same" methods applied to given public data sets. 
Can one then draw general conclusions about the 
effectiveness of the method of neural networks? Fur- 
thermore, to what extent is preprocessing the data 
to be regarded as part of the method? Linear dis- 
criminant analysis on raw data may yield very dif- 
ferent results from the same model applied to data 
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which has been processed to remove skewness. Is, 
then, Hnear discriminant analysis good or bad on 
these data? Likewise, is a data set in which miss- 
ing values have been replaced by imputed values the 
same as a data set in which incomplete records have 
been dropped? Applying the same method to the 
two variants of the data is likely to yield different 
results. 

We have already commented that the "accuracy" 
of a classification rule can be measured in a wide va- 
riety of ways, and that different measures are likely 
to yield different performance rankings of classifiers. 

Given all of the above points, it is not surprising 
that different authors have drawn different conclu- 
sions about the relative accuracy of different clas- 
sifiers. Other commentators have taken things even 
further. In the discussion that accompanies [3], Efron 
suggested that new methods always look better than 
older ones and that complicated methods are harder 
to criticize than simpler ones. He also noted that 
it is difficult to make fair comparisons by making 
the same effort in applying different methods — a 
point made above. Hoadley, in the same discussion, 
"coined a phrase called the 'ping-pong theorem.' 
This theorem says that if we revealed to Profes- 
sor Breiman the performance of our best model and 
gave him our data, then he could develop an algo- 
rithmic model using random forests, which would 
outperform our model. But if he revealed to us the 
performance of his model, then we could develop a 
segmented scorecard, which would outperform his 
model." 

With so many difficulties in ranking and compar- 
ing classifiers, one might naturally have reservations 
about small differences in performance — of the kind 
generally asserted for the more complicated and so- 
phisticated methods over the older and simpler mod- 
els. 

6. CONCLUSION 

In Section 2 we demonstrated that, when build- 
ing predictive models of increasing complexity, the 
marginal gain from complicated models is typically 
small compared to the predictive power of the sim- 
ple models. In many cases, the simple models ac- 
counted for over 90% of the predictive power that 
could be achieved by "the best" model we could find. 
Now, in the idealized classical supervised classifica- 
tion paradigm, certain assumptions are implicit: it is 
assumed that the distributions from which the de- 
sign points and the new points are drawn are the 



same, that the classes are well defined and the def- 
initions will not change, that the costs of different 
kinds of misclassification are known accurately, and 
so on. In real applications, however, these additional 
assumptions will often not hold. This means that 
apparent small (laboratory) gains in performance 
might not be realized in practice — they may well be 
swamped by uncertainties arising from mismatches 
between the apparent problem and the real problem. 
In particular, many of the comparative studies in 
the literature are based on brief descriptions of data 
sets, containing no background information at all 
on such possible additional sources of variation due 
to breakdown of implicit assumptions of the kind 
illustrated above. This must cast doubt on the va- 
lidity of their conclusions. In general, it means that 
deeper critical assessment of the context of the prob- 
lem and data should be made if useful practical con- 
clusions are to be drawn. If enough is known about 
likely additional sources of variability, beyond the 
classical sources of sampling variability and model 
uncertainty, then more sophisticated models can be 
built. However, if insufficient information is known 
about these additional sources, which we speculate 
will very often be the case, then the principle of par- 
simony suggests that it is better to stick to simple 
models. 

We should note, parenthetically, that there are 
also other reasons to favor simple models. Inter- 
pretability, in particular, is often an important re- 
quirement of a classification rule. Indeed, sometimes 
it is even a legal requirement (e.g., in credit scoring). 
This leads us to the observation that what one re- 
gards as "simple" may vary from user to user: some 
might favor weighted sums of predictor values, oth- 
ers might prefer (small) tree structures and yet oth- 
ers might regard nearest neighbor methods as being 
simple. 

Perhaps it is appropriate to conclude with the 
comment that, by arguing that simple models are 
often more appropriate than complex ones and that 
the claims of superior performance of the more com- 
plex models may be misleading, I am not suggest- 
ing that no major advances in classification methods 
will ever be made. Such a claim would be absurd in 
the face of developments such as the bootstrap and 
other resampling approaches, which have led to sig- 
nificant advances in classification and other statis- 
tical models. All I am saying is that much of the 
purported advance may well be illusory. Further- 
more, although (almost by definition) one cannot 
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predict where the next step-change will come from, 
one might venture a guess as to its general area. Re- 
sampling methods are children of the computer revo- 
lution, as indeed are most other recent developments 
in classifier technology [e.g., classification trees, neu- 
ral networks, support vector machines, random forests, 
multivariate adaptive regression splines (MARS) and 
practical Bayesian methods] . Since progress in com- 
puter hardware is continuing, one might reasonably 
expect that the advances will arise from more pow- 
erful data storage and processing ability. 
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